PhDs Awarded by Field Over Time

In this demo, we’ll first work with a dataset on the number of PhD degrees awarded in the US from TidyTuesday.

# Read in the tidytuesday data
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
phd_field <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-19/phd_by_field.csv")
## Rows: 3370 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): broad_field, major_field, field
## dbl (2): year, n_phds
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
phd_field
## # A tibble: 3,370 × 5
##    broad_field   major_field                                 field   year n_phds
##    <chr>         <chr>                                       <chr>  <dbl>  <dbl>
##  1 Life sciences Agricultural sciences and natural resources Agric…  2008    111
##  2 Life sciences Agricultural sciences and natural resources Agric…  2008     28
##  3 Life sciences Agricultural sciences and natural resources Agric…  2008      3
##  4 Life sciences Agricultural sciences and natural resources Agron…  2008     68
##  5 Life sciences Agricultural sciences and natural resources Anima…  2008     41
##  6 Life sciences Agricultural sciences and natural resources Anima…  2008     18
##  7 Life sciences Agricultural sciences and natural resources Anima…  2008     77
##  8 Life sciences Agricultural sciences and natural resources Envir…  2008    182
##  9 Life sciences Agricultural sciences and natural resources Fishi…  2008     52
## 10 Life sciences Agricultural sciences and natural resources Food …  2008     96
## # … with 3,360 more rows

Let’s start by grabbing the rows corresponding to Statistics PhDs. While there are a number of ways to do this, we can grab field containing “statistics” (including biostatistics) with the str_detect() function.

stats_phds <- phd_field %>%
  filter(str_detect(tolower(field), "statistics"))

What are the different fields that were captured?

table(stats_phds$field)
## 
##                        Biometrics and biostatistics 
##                                                  10 
##            Educational statistics, research methods 
##                                                  10 
## Management information systems, business statistics 
##                                                  10 
##                 Mathematics and statistics, general 
##                                                  10 
##                   Mathematics and statistics, other 
##                                                  10 
##                            Statistics (mathematics) 
##                                                  10 
##                        Statistics (social sciences) 
##                                                  10

To start, let’s just summarize the number of PhDs by year:

stat_phd_year_summary <- stats_phds %>%
  group_by(year) %>%
  summarize(n_phds = sum(n_phds))

Now, we’ll make the typical scatterplot display with n_phds on the y-axis and year on the x-axis:

stat_phd_year_summary %>%
  ggplot(aes(x = year, y = n_phds)) +
  geom_point() +
  theme_bw() +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time")

We should fix our x-axis here and make the breaks more informative. In this case, I’ll change it so each year is labeled (that may not be appropriate for every visual but it works out here).

stat_phd_year_summary %>%
  ggplot(aes(x = year, y = n_phds)) +
  geom_point() +
  # Modify the x-axis to make the axis breaks at the unique years and show their
  # respective labels
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year)) +
  theme_bw() +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time")

To emphasize the ordering of the year along the x-axis, we’ll add a line connecting the points to emphasize the order:

stat_phd_year_summary %>%
  ggplot(aes(x = year, y = n_phds)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year)) +
  theme_bw() +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time")

We can drop the points, leaving only the connecting lines to emphasize trends:

stat_phd_year_summary %>%
  ggplot(aes(x = year, y = n_phds)) +
  geom_line() +
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year)) +
  theme_bw() +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time")

Another common way to display trends is by filling in the area under the line. However, this is only appropriate when the y-axis starts at 0! It’s also redundant use of ink so just be careful when deciding whether or not to fill the area. We can fill the area under the line with the geom_area() aesthetic - but note that it changes the y-axis by default to start at 0:

stat_phd_year_summary %>%
  ggplot(aes(x = year, y = n_phds)) +
  # Fill the area under the line
  geom_area(fill = "darkblue", alpha = 0.5) +
  geom_line() +
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year)) +
  theme_bw() +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time")

You can also make this plot using the ggridges package.

Plotting and labeling several lines

We’ll now switch to displaying the different Statistics fields separately with the stats_phds dataset. First, we should NOT display multiple time series with just points as follows:

stats_phds %>%
  ggplot(aes(x = year, y = n_phds, color = field)) +
  geom_point() +
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year)) +
  theme_bw() +
  theme(legend.position = "bottom",
        # Adjust the size of the legend's text
        legend.text = element_text(size = 5),
        legend.title = element_text(size = 6)) +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time",
       color = "Field")

It’s much simpler to just display the lines to compare the trends:

stats_phds %>%
  ggplot(aes(x = year, y = n_phds, color = field)) +
  geom_line() +
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year)) +
  theme_bw() +
  theme(legend.position = "bottom",
        # Adjust the size of the legend's text
        legend.text = element_text(size = 5),
        legend.title = element_text(size = 6)) +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time",
       color = "Field")

The legend is pretty cluttered though, instead we can directly label the displayed lines using the ggrepel package. We first need to create a dataset with just the final values (which in this case corresponds to year == 2017), and then add labels for these values. To make the labels visible, we need to increase our x-axis limits. Note that this is a “hack”, but you will rely on hacks to customize visuals in the future… The following code chunk demonstrates how to do this:

stats_phds_2017 <- stats_phds %>%
  filter(year == 2017)

# Access the ggrepel package:
# install.packages("ggrepel")
library(ggrepel)
stats_phds %>%
  ggplot(aes(x = year, y = n_phds, color = field)) +
  geom_line() +
  # Add the labels:
  geom_text_repel(data = stats_phds_2017,
                  aes(label = field),
                  size = 2, 
                  # Drop the segment connection:
                  segment.color = NA, 
                  # Move labels up or down based on overlap
                  direction = "y",
                  # Try to align the labels horizontally on the left hand side
                  hjust = "left") +
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year),
                     # Update the limits so that there is some padding on the
                     # x-axis but don't label the new maximum
                     limits = c(min(stat_phd_year_summary$year),
                                max(stat_phd_year_summary$year) + 3)) +
  theme_bw() +
  # Drop the legend
  theme(legend.position = "none") +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time",
       color = "Field")

Next, let’s switch to back to the original dataset phd_field. What happens if we plot a line for every field attempting to use the color aesthetic to separate them?

phd_field %>%
  ggplot(aes(x = year, y = n_phds, color = field)) +
  geom_line() +
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year)) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time",
       color = "Field")
## Warning: Removed 270 rows containing missing values (`geom_line()`).

The plot above is obviously a disaster… When we are dealing with potentially way too many categories, we can instead highlight lines of interest while setting the background lines to gray, so we can still see background trends. We need to use the group aesthetic to split the gray lines from each other. Plus, we should adjust the alpha due to the overlap. The following code chunk demonstrates how to do this for highlighting the “Statistics (mathematics)” and “Biometrics and biostatistics” lines. We essentially create separate plot layers by filtering on the field variable:

# First display the background lines using the full dataset with those two fields 
# filtered out:
phd_field %>%
  # The following line says: NOT (field in c("Biometrics and biostatistics", "Statistics (mathematics)"))
  filter(!(field %in% c("Biometrics and biostatistics", 
                        "Statistics (mathematics)"))) %>%
  ggplot() +
  # Add the background lines - need to specify the group to be the field
  geom_line(aes(x = year, y = n_phds, group = field),
            color = "gray", size = .5, alpha = .5) +
  # Now add the layer with the lines of interest:
  geom_line(data = filter(phd_field,
                          # Note this is just the opposite of the above since ! is removed
                          field %in% c("Biometrics and biostatistics", 
                                       "Statistics (mathematics)")),
            aes(x = year, y = n_phds, color = field),
            # Make the size larger
            size = .75, alpha = 1) +
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year)) +
  theme_bw() +
  theme(legend.position = "bottom", 
        # Drop the panel lines making the gray difficult to see
        panel.grid = element_blank()) +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time",
       color = "Field")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## Warning: Removed 270 rows containing missing values (`geom_line()`).

Florence Nightingale’s rose diagrams

Another way to visualize time series data is to display it in a cycle pattern, using polar coordinates, as done by Florence Nightingale’s famous rose diagram. We can recreate the rose diagram by accessing the data in the HistData package. We’ll first load and print out the first so many rows of the data below:

library(HistData)
head(Nightingale)
##         Date Month Year  Army Disease Wounds Other Disease.rate Wounds.rate
## 1 1854-04-01   Apr 1854  8571       1      0     5          1.4         0.0
## 2 1854-05-01   May 1854 23333      12      0     9          6.2         0.0
## 3 1854-06-01   Jun 1854 28333      11      0     6          4.7         0.0
## 4 1854-07-01   Jul 1854 28722     359      0    23        150.0         0.0
## 5 1854-08-01   Aug 1854 30246     828      1    30        328.5         0.4
## 6 1854-09-01   Sep 1854 30290     788     81    70        312.2        32.1
##   Other.rate
## 1        7.0
## 2        4.6
## 3        2.5
## 4        9.6
## 5       11.9
## 6       27.7

To recreate the plot, we’ll need to first make a longer version of the dataset with the Disease, Wounds, and Other columns separated into three rows. To do that, we’ll use the pivot_longer() function after just selecting the columns of interest for our plot:

crimean_war_data <- Nightingale %>%
  dplyr::select(Date, Month, Year, Disease, Wounds, Other) %>%
  # Now pivot those columns to take up separate rows:
  pivot_longer(Disease:Other,
               names_to = "cause", values_to = "count")

Next, we’ll make a label column matching Nightingale’s plot based on the Date column. We’ll talk about dates more in next lecture, but we can condition on being above or below certain dates in a natural way:

crimean_war_data <- crimean_war_data %>%
  mutate(time_period = ifelse(Date <= as.Date("1855-03-01"),
                              "April 1854 to March 1855", 
                              "April 1855 to March 1856"))

And finally we can go ahead and display the rose diagram facetted by the time period (using similar colors to Nightingale):

crimean_war_data %>% 
  ggplot(aes(x = Month, y = count)) + 
  geom_col(aes(fill = cause), width = 1, 
           position = "identity", alpha = 0.5) + 
  coord_polar() + 
  facet_wrap(~ time_period, ncol = 2) +
  scale_fill_manual(values = c("skyblue3", "grey30", "firebrick")) +
  scale_y_sqrt() +
  theme_void() +
  # All of this below is to just customize the theme in a way that we are
  # close to resembling the original plot (ie lets make it look old!)
  theme(axis.text.x = element_text(size = 9),
        strip.text = element_text(size = 11),
        legend.position = "bottom",
        plot.background = element_rect(fill = alpha("cornsilk", 0.5)),
        plot.margin = unit(c(10, 10, 10, 10), "pt"),
        plot.title = element_text(vjust = 5)) +
  labs(title = "Diagram of the Causes of Mortality in the Army in the East")

This looks pretty close to the original diagram, except the order of the months does not match the original. We can of course change that by reordering the factor variable:

crimean_war_data %>% 
  # Manually relevel it to match the original plot
  mutate(Month = fct_relevel(Month, 
                             "Jul", "Aug", "Sep", "Oct", "Nov",
                             "Dec", "Jan", "Feb", "Mar", "Apr", "May", "Jun")) %>%
  ggplot(aes(x = Month, y = count)) + 
  geom_col(aes(fill = cause), width = 1, 
           position = "identity", alpha = 0.5) + 
  coord_polar() + 
  facet_wrap(~ time_period, ncol = 2) +
  scale_fill_manual(values = c("skyblue3", "grey30", "firebrick")) +
  scale_y_sqrt() +
  theme_void() +
  # All of this below is to just customize the theme in a way that we are
  # close to resembling the original plot (ie lets make it look old!)
  theme(axis.text.x = element_text(size = 9),
        strip.text = element_text(size = 11),
        legend.position = "bottom",
        plot.background = element_rect(fill = alpha("cornsilk", 0.5)),
        plot.margin = unit(c(10, 10, 10, 10), "pt"),
        plot.title = element_text(vjust = 5)) +
  labs(title = "Diagram of the Causes of Mortality in the Army in the East")

How does this compare to just a simple line graph?

crimean_war_data %>% 
  ggplot(aes(x = Date, y = count, color = cause)) + 
  geom_line() +
  # Add a reference line at the cutoff point
  geom_vline(xintercept = as.Date("1855-03-01"), linetype = "dashed",
             color = "gray") +
  scale_color_manual(values = c("skyblue3", "grey30", "firebrick")) +
  scale_y_sqrt() +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(title = "Diagram of the Causes of Mortality in the Army in the East",
       y = "sqrt(counts)", x = "Date")

We can customize the x-axis further using scale_x_date():

crimean_war_data %>% 
  ggplot(aes(x = Date, y = count, color = cause)) + 
  geom_line() +
  # Add a reference line at the cutoff point
  geom_vline(xintercept = as.Date("1855-03-01"), linetype = "dashed",
             color = "gray") +
  scale_color_manual(values = c("skyblue3", "grey30", "firebrick")) +
  scale_y_sqrt() +
  # Format to use abbreviate month %b with year %Y
  scale_x_date(date_labels = "%b %Y") +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(title = "Diagram of the Causes of Mortality in the Army in the East",
       y = "sqrt(counts)", x = "Date")

Which one do you prefer? Maybe filling the area under the lines would be better here…

Animations

Use gganimate to add animations

By far, the simplest way to create visualizations with animations is to use the gganimate package. This effectively works as an extension to ggplot figures but with the inclusion of various transition_* functions

When should we animate plots?

First, let’s think about when you should NOT animate a plot. We first create a visualization of the penguins data from before, of bill length on the y-axis against the body mass on the x-axis colored by species:

library(palmerpenguins)

penguins %>% 
  ggplot(aes(x = body_mass_g, y = bill_length_mm, color = species)) +
  geom_point(alpha = 0.5, size = 2) +
  labs(x = "Body Mass (g)", y = "Bill Length (mm)") +
  theme_bw()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Now, we could do the following: use the gganimate package to only display one species at a time with the transition_states() function:

library(gganimate)
penguins %>% 
  ggplot(aes(x = body_mass_g, y = bill_length_mm, color = species)) +
  geom_point(alpha = 0.5, size = 2) +
  labs(x = "Body Mass (g)", y = "Bill Length (mm)") +
  theme_bw() +
  transition_states(species,
                    transition_length = 0.5,
                    state_length = 1)

The use of transition_length and state_length indicate how much relative time should take place when transitioning between states and the pause at each state, respectively. But the above use of animation is useless!

So when should you consider using animation?

One appropriate usage is in the context of storytelling with data, to emphasize some aspect of your visual display. For instance, we’ll borrow this F1 racing dataset from Meghan Hall’s iteration of 36-315 to compare the performance of three racing teams:

# First load the data from Meghan's github
f1_data_ex <- read_csv('https://raw.githubusercontent.com/meghall06/CMU-36-315-site/main/data/constructor_pts.csv') %>%
  filter(name %in% c("McLaren", "Renault", "Racing Point"),
         year == 2020)
## Rows: 1260 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): name, race_name
## dbl  (9): constructorStandingsId, raceId, constructorId, points, position, p...
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Now display the results across the rounds:
f1_data_ex %>%
  ggplot(aes(x = round, y = points, group = name, color = name)) +
  geom_line(size = 2) +
  scale_x_continuous(breaks = seq(1, 17, 1)) +
  labs(title = "The race for third place in the 2020 F1 season",
       y = "Accumulated points", x = NULL) +
  theme_bw()

From above we can see the accumulated points increasing over time for each team, with McLaren finishing better than both, Racing Point and Renault, at the end. But we could incrementally reveal the results at each stage emphasize the story of progression. We’re not adding another dimension to the display, but we emphasize the intermediate results through animation with the transition_reveal() function:

f1_data_ex %>%
  ggplot(aes(x = round, y = points, group = name, color = name)) +
  geom_line(size = 2) +
  scale_x_continuous(breaks = seq(1, 17, 1)) +
  labs(title = "The race for third place in the 2020 F1 season",
       y = "Accumulated points", x = NULL) +
  theme_bw() +
  # Reveal the results by round
  transition_reveal(round)
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

The most effective use of animation is when it adds another dimension to your visualization, typically in the form of time. The previous visualization only animated across the x-axis - it did NOT add another variable in our data. However, animation can let us bring in another dimension so that we can see differences between relationships of variables in various ways. You should watch Hans Rosling’s 200 Countries, 200 Years, 4 Minutes to see one example in action. We can make similar visualizations with gganimate.

In the code chunk below, we’re going to display yearly summaries about housing sales in TX (dataset comes loaded with ggplot2). We’re going to plot the average number of active listings and average median sale price for each city-year combination in the data. For context, we’re going to highlight the data for Houston in red with a larger point size:

# Load the scales package for better labeling of the axes
txhousing %>% 
  group_by(city, year) %>% 
  summarize(median = mean(median, na.rm = TRUE),
            listings = mean(listings, na.rm = TRUE)) %>% 
  ggplot(aes(x = median, y = listings, 
             color = (city == "Houston"),
             size = (city == "Houston"))) +
  # Hide the legend for the point layer
  geom_point(alpha = 0.5, show.legend = FALSE) +
  # Manual color label
  scale_color_manual(values = c("black", "darkred")) +
  # Manual size adjustment
  scale_size_manual(values = c(2, 4)) +
  scale_x_continuous(labels = scales::dollar, name = "Median Price") +
  scale_y_continuous(labels = scales::label_number_si()) +
  theme_bw() +
  labs(x = "Median Price", y = "Avg. of Monthly Listings",
       subtitle = "Houston in red")
## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.
## Warning: `label_number_si()` was deprecated in scales 1.2.0.
## ℹ Please use the `scale_cut` argument of `label_number()` instead.
## Warning: Removed 68 rows containing missing values (`geom_point()`).

In the figure above we do not have year included in any way. But we can use the transition_time() function to animate the visual over time, while also updating the plot title to include the displayed year:

# Load the scales package for better labeling of the axes
txhousing %>% 
  group_by(city, year) %>% 
  summarize(median = mean(median, na.rm = TRUE),
            listings = mean(listings, na.rm = TRUE)) %>% 
  ggplot(aes(x = median, y = listings, 
             color = (city == "Houston"),
             size = (city == "Houston"))) +
  # Hide the legend for the point layer
  geom_point(alpha = 0.5, show.legend = FALSE) +
  # Manual color label
  scale_color_manual(values = c("black", "darkred")) +
  # Manual size adjustment
  scale_size_manual(values = c(2, 4)) +
  scale_x_continuous(labels = scales::dollar, name = "Median Price") +
  scale_y_continuous(labels = scales::label_number_si()) +
  theme_bw() +
  labs(x = "Median Price", y = "Avg. of Monthly Listings",
       subtitle = "Houston in red", 
       title = "Year: {frame_time}") +
  transition_time(year)
## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.

From viewing the above visual, you can see how animation makes changes appear more dramatic between years - versus plotting each year separately with facets. We can then save the above animation as a GIF with the anim_save("INSERT/FILEPATH") function, which will save the last animation you made by default.

anim_save("examples/txhousing.gif")

Some key points to think about before adding animation to a visualization:

  1. Always make and describe the original / base graphic first that does NOT include animation.

  2. Before adding animation to the graph, ask yourself: How would animation give you additional insights about the data that you would otherwise not be able to?

  3. Never add animation just because it’s cool!

  4. When presenting, make sure you explain exactly what is being displayed with animation and what within the animation you want to emphasize. This will help you determine if animation is actually worth including.